CPSC 330 Lecture 8: Hyperparameter Optimization

Varada Kolhatkar

Announcements

  • Important information about midterm 1
    • https://piazza.com/class/m01ukubppof625/post/249
  • Change of my office hours
    • Thursdays from 2 to 3 in my office ICCS 237
  • HW3 is due next today 11:59 pm.
  • HW4 has been released

Recap: CountVectorizer

  • Unlike many other transformers which take a DataFrame as input…
  • CountVectorizer is designed to work with pandas.Series
  • DataFrame Input: Used by most transformers for handling multiple features.
  • Series Input: CountVectorizer simplifies processing by focusing on one text column at a time.

Hyperparameter optimization motivation

Data

sms_df = pd.read_csv(DATA_DIR + "spam.csv", encoding="latin-1")
sms_df = sms_df.drop(columns = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})
train_df, test_df = train_test_split(sms_df, test_size=0.10, random_state=42)
X_train, y_train = train_df["sms"], train_df["target"]
X_test, y_test = test_df["sms"], test_df["target"]
train_df.head(4)
target sms
3130 spam LookAtMe!: Thanks for your purchase of a video...
106 ham Aight, I'll hit you up when I get some cash
4697 ham Don no da:)whats you plan?
856 ham Going to take your babe out ?

Model building

  • Let’s define a pipeline
pipe_svm = make_pipeline(CountVectorizer(), SVC())
  • Suppose we want to try out different hyperparameter values.
parameters = {
    "max_features": [100, 200, 400],
    "gamma": [0.01, 0.1, 1.0],
    "C": [0.01, 0.1, 1.0],
}

Hyperparameter optimization with loops

  • Define a parameter space.
  • Iterate through possible combinations.
  • Evaluate model performance.
  • What are some limitations of this approach?

sklearn methods

  • sklearn provides two main methods for hyperparameter optimization
    • Grid Search
    • Random Search

Grid search example

from sklearn.model_selection import GridSearchCV

pipe_svm = make_pipeline(CountVectorizer(), SVC())

param_grid = {
    "countvectorizer__max_features": [100, 200, 400],
    "svc__gamma": [0.01, 0.1, 1.0],
    "svc__C": [0.01, 0.1, 1.0],
}
grid_search = GridSearchCV(pipe_svm, 
                  param_grid = param_grid, 
                  n_jobs=-1, 
                  return_train_score=True
                 )
grid_search.fit(X_train, y_train)
grid_search.best_score_
0.9782606272997375

Random search example

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

pipe_svc = make_pipeline(CountVectorizer(), SVC())

param_dist = {
    "countvectorizer__max_features": randint(100, 2000), 
    "svc__C": uniform(0.1, 1e4),  # loguniform(1e-3, 1e3),
    "svc__gamma": loguniform(1e-5, 1e3),
}
random_search = RandomizedSearchCV(pipe_svm,                                    
                  param_distributions = param_dist, 
                  n_iter=10, 
                  n_jobs=-1, 
                  return_train_score=True)

# Carry out the search
random_search.fit(X_train, y_train)
random_search.best_score_
0.9822492602034216

Optimization bias

  • Why do we need separate validation and test datasets?

Mitigating optimization bias.

  • Cross-validation
  • Ensembles
  • Regularization and choosing a simpler model

(iClicker) Exercise 8.1

iClicker cloud join link: https://join.iclicker.com/VYFJ

Select all of the following statements which are TRUE.

    1. If you get best results at the edges of your parameter grid, it might be a good idea to adjust the range of values in your parameter grid.
    1. Grid search is guaranteed to find the best hyperparameter values.
    1. It is possible to get different hyperparameters in different runs of RandomizedSearchCV.

Questions for you

  • You have a dataset and you give me 1/10th of it. The dataset given to me is rather small and so I split it into 96% train and 4% validation split. I carry out hyperparameter optimization using a single 4% validation split and report validation accuracy of 0.97. Would it classify the rest of the data with similar accuracy?
    • Probably
    • Probably not

Questions for class discussion

  • Suppose you have 10 hyperparameters, each with 4 possible values. If you run GridSearchCV with this parameter grid, how many cross-validation experiments will be carried out?
  • Suppose you have 10 hyperparameters and each takes 4 values. If you run RandomizedSearchCV with this parameter grid with n_iter=20, how many cross-validation experiments will be carried out?

Class Demo